Skip to content

feat(quasicryth-research): direct C→Rust transcode + COW radix trie variant#461

Merged
AdaWorldAPI merged 7 commits into
mainfrom
claude/splat3d-cpu-simd-renderer-MAOO0
Jun 4, 2026
Merged

feat(quasicryth-research): direct C→Rust transcode + COW radix trie variant#461
AdaWorldAPI merged 7 commits into
mainfrom
claude/splat3d-cpu-simd-renderer-MAOO0

Conversation

@AdaWorldAPI

@AdaWorldAPI AdaWorldAPI commented Jun 4, 2026

Copy link
Copy Markdown
Owner

Summary

Direct Rust transcode of Quasicryth (Tacconelli 2026, arxiv 2603.14999, upstream github.com/robtacconelli/quasicryth v5.6.0) in two architectural variants behind one trait: the original flat-storage codebook from the C reference, and a Copy-on-Write Adaptive Radix Tree variant that fits this workspace's append-only substrate doctrine.

New excluded crate crates/quasicryth-research/ — standalone, zero-dep, follows the helix / bgz17 / deepnsm convention.

6 phases, 6 commits

Commit Phase Modules LOC Tests
f0dfe88 0 tiling + hierarchy + constants + types (from fib.c) 1,300 28
68f754e 1 md5 + tok (RFC 1321 + word tokenization) +650 +20
9e229d5 2 codebook trait + FlatCodebook + CowRadixCodebook + CowArt +740 +8
afd7969 3 arith_coder (Model256, VModel, Encoder, Decoder) +640 +9
de566f6 4 pipeline (compress, decompress, Variant) +460 +11
7fed9b9 5+6 cross-variant integration + CLI binary +400 +7

Total: ~4,160 LOC Rust, 83 tests passing, cargo clippy -- -D warnings clean (pedantic + all), cargo fmt clean. Zero dependencies. No unsafe. Stable Rust.

Two variants behind one trait

pub trait Codebook: Send + Sync {
    fn n_unique(&self) -> u32;
    fn n_uni(&self) -> u32;
    fn unigram_index(&self, word_id: u32) -> Option<u32>;
    fn unigram_word(&self, idx: u32) -> Option<u32>;
    // ... bigram + n-gram methods
}
Property FlatCodebook CowRadixCodebook
Storage flat Vec<u32> + HashMap ART (Node4 / Node16 / Node256) per tier
Lookup O(1) avg O(key_len) walk
Versioning none path-copy COW — every insert returns a new root, prior roots stay valid
Append-only fit no yes — fits workspace substrate doctrine
Threading Send + Sync Send + Sync (Arc-shared subtrees)

The COW property is explicitly tested in codebook::tests::cow_art_path_copy_preserves_old_rootart_v0 stays empty after art_v1.insert(...) and art_v2.insert(...). Tests also verify that the two variants agree on lookups (cow_radix_codebook_agrees_with_flat_on_lookups) and on end-to-end decompressed output (variants_produce_same_decompressed_output, cross_variant_independence).

What round-trips end-to-end

pipeline::compress(text, Variant::Flat | Variant::CowRadix) → bytes and pipeline::decompress(bytes) → text round-trip on every test input, including:

  • empty, single word, whitespace-only
  • mixed case (Hello WORLD foo)
  • punctuation, newlines, tabs, quotes, parens, hyphens
  • repeated phrases, 5 KB cyclic text, pseudo-random English
  • UTF-8 high-bit (café naïve façade)
  • 600-character Fibonacci-theory paragraph

Both variants produce identical decompressed output (compressed bytes may differ).

Paper-theorem verification (algebraic substrate)

tests/paper_theorems.rs verifies, on synthetic L/S sequences:

  • Thm 2 Fibonacci hierarchy never collapses
  • Cor 4 Period-5 collapses by level 4 or 5 (vs Fibonacci's unbounded depth)
  • Thm 9 Golden Compensation: L:S ratio = φ at every level
  • Thm 13 / Cor 15 Aperiodic advantage grows with corpus scale
  • Sturmian Factor complexity ≤ n+1 (the minimality property behind maximal codebook efficiency)
  • PV-property (φ² = φ+1), HIER_WORD_LENS = F_3..F_12, no-adjacent-S on all 36 canonical tilings

This is the mathematical underpinning the workspace's φ-substrate decisions (bgz17's 17φ/11, helix's golden-spiral hemisphere, jc::weyl's 1-D star-discrepancy) inherit. The transcode lets the workspace cross-check those decisions against the reference algebra without depending on the upstream C build.

CLI binary

cargo build --bin qresearch --manifest-path crates/quasicryth-research/Cargo.toml
qresearch round-trip /path/to/file.txt           # default: Flat
qresearch round-trip -v cow /path/to/file.txt    # COW radix trie variant
qresearch compress -v cow in.txt out.qrs1
qresearch decompress out.qrs1 in.txt.recovered

Tested live in the commit message of 7fed9b9.

Deliberate simplifications (NOT a production compressor)

Documented in module-level docs + README. The Rust pipeline is research-grade, NOT byte-compatible with the upstream .qm56 format:

  • Single-tier codebook — unigrams only. The Fibonacci tiling + substitution hierarchy + deep-position detection are verified against the paper's theorems via tests/paper_theorems.rs, but the bit-stream itself only encodes word-ID symbols at the unigram tier. Multi-tier n-gram encoding is a phase 5+ extension.
  • No LZMA escape stream — OOV → error (compressed pipeline still works because pipeline::build_codebook caps the unigram tier at n_unique).
  • No 36-tiling greedy selection in the bit-stream (Fibonacci-only mode equivalent).
  • No word-level LZ77, no multi-tier unigram model, no per-level context models.
  • NOT byte-identical to the C reference output. The Rust pipeline round-trips with itself; matching the upstream .qm56 exactly would require porting hundreds of model-initialization details and is out of scope for "research and testing."

Implication: small inputs (sub-KB) currently produce >100% "compressed" output because headers + per-token spans dominate. This is expected and called out in the README. The architectural property (codebook + AC working end-to-end across both variants) is what's being demonstrated.

Crate policy

  • Standalone, zero dependency (only std)
  • excluded from the lance-graph workspace per the helix / bgz17 / deepnsm convention
  • cargo test --manifest-path crates/quasicryth-research/Cargo.toml — 83 passing
  • cargo clippy --all-targets -- -D warnings — clean (pedantic + all)
  • cargo fmt — clean
  • Cargo.lock gitignored per helix convention

What this PR is, in one sentence

A research transcode that demonstrates the architectural variation point (FlatCodebook vs CowRadixCodebook) the workspace cares about, working end-to-end through tokenize → codebook → arithmetic-code → bytes → decode, verified by 83 tests against both variants and against the paper's five core theorems.

🤖 Generated with Claude Code


Generated by Claude Code

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced quasicryth-research crate providing Quasicryth v5.6.0 compression and decompression capabilities.
    • New qresearch command-line tool supporting compress, decompress, and round-trip verification operations.
    • Two compression implementation variants available: Flat and CowRadix for different use cases.
  • Documentation

    • Comprehensive README documenting the crate, CLI usage, compression variants, and testing procedures.
    • Full test suite validating round-trip compression across various input patterns and variants.

claude added 6 commits June 4, 2026 11:33
Standalone, zero-dep research/testing crate transcoding fib.h + fib.c +
the algebraic types from qtc.h of the upstream Quasicryth v5.6.0 C
reference (Tacconelli 2026, arxiv 2603.14999, upstream
github.com/robtacconelli/quasicryth).

Scope: the algebra the paper proves theorems about, not the compressor.

What's transcoded
- types.rs        — Tile, HLevel, ParentMap, Hierarchy, DeepPositions,
                    TilingDesc (idiomatic Rust ownership; no unsafe).
- constants.rs    — PHI, INV_PHI, HIER_WORD_LENS = {2,3,5,8,13,21,34,55,
                    89,144} = F_3..F_12, MAX_HIER=10, the 36-tiling
                    descriptor table (12 golden phases + sqrt(58)-7
                    + noble-5 + sqrt(13)-3 + 18 greedy-discovered alphas
                    including the far-out alpha=0.502).
- tiling.rs       — cut-and-project (qc_word_tiling[_alpha]) + five
                    substitution-rule families (Thue-Morse, Rudin-Shapiro,
                    period-doubling, Period-5, Sanddrift).
- hierarchy.rs    — build_hierarchy (iterative deflation
                    (L,S)->super-L, L->super-S), hier_context,
                    detect_deep_positions, deep_counts.

What's NOT transcoded
The full v5.6 production compressor pipeline (ac.c arithmetic coding,
cb.c codebook construction, compress.c / decompress.c, tok.c
tokenization, md5.c, LZMA escape). Out of scope for "research and
testing" — the goal is verifying the workspace's phi-substrate
decisions against the reference algebra, not byte-compatibility with
the upstream compressed output.

Verification (28 tests, all passing)
- 19 unit tests covering each module's invariants
- 9 integration tests in tests/paper_theorems.rs verifying:
  * Thm 2  Fibonacci hierarchy never collapses
  * Cor 4  Period-5 collapses by level ~3.3 = log(5)/log(phi)
  * Thm 9  Golden Compensation (L:S ratio = phi at every level)
  * Thm 13/Cor 15  Aperiodic advantage grows with corpus scale
  * Sturmian factor complexity <= n+1 (Thm 7 root)
  * PV-property phi^2 = phi + 1
  * HIER_WORD_LENS = Fibonacci F_3..F_12
  * No-adjacent-S on all 36 canonical tilings

cargo clippy --all-targets -- -D warnings clean (pedantic+all). rustfmt
clean. Zero-dependency default build.

Relationship to workspace crates
- bgz17 (17*phi/11 = 5/2 = octave + major third) — this crate verifies
  the non-collapse theorem that justifies phi over rational stacking
  approximations.
- helix (golden-spiral hemisphere, Fisher-Z aligned) — Sturmian
  minimality theorem here is the optimality argument for phi as the
  azimuth stride.
- jc::weyl (1-D Weyl discrepancy at N=144, N=1000) — this crate's
  qc_word_tiling exercises the same phi-stride at hierarchy scale.

Listed under root Cargo.toml `exclude` so it never enters the main
compile graph. Verified via cargo test --manifest-path
crates/quasicryth-research/Cargo.toml. Follows the helix convention:
Cargo.lock gitignored; the crate stays standalone-verifiable.
Phase 1 of the full-pipeline transcode plan. Two new modules:

- src/md5.rs (RFC 1321 / md5.c transcode, 196 LOC)
  * Md5 incremental hasher + one-shot md5() function
  * Direct port of upstream md5.c; bit-exact match
  * 8 tests covering the full RFC 1321 §A.5 test suite
    (empty, "a", "abc", "message digest", alphabet,
    alphanumeric, 80-digit long input, incremental==one-shot)

- src/tok.rs (tok.c transcode, 377 LOC, partial)
  * tokenize() — split raw bytes into Token spans with case
    separation; lowered byte stream + per-token (offset, len,
    case_flag) tracking
  * word_split() — pre-lowered byte stream → word offsets, no
    case work (lighter path)
  * apply_case() — reverse the case lowering for a token
  * TokenStream::round_trip() — the round-trip the C reference
    verifies internally via case_roundtrips
  * 12 tests covering case detection (lower/Cap/UPPER),
    round-trip on lowercase / mixed-case / punctuation / empty
    / UTF-8 high-bit; word_split byte-order preservation

NOT in this phase (deferred):
- enc_case / dec_case — depend on the arithmetic coder
  (phase 3, ac.c transcode)

Total tests: 48 (was 28). +20 from md5 (8) and tok (12).

Verification:
- cargo test --manifest-path crates/quasicryth-research/Cargo.toml
  → 39 unit + 9 integration = 48 passed, 0 failed
- cargo clippy --all-targets -- -D warnings clean
  (added 4 pedantic-lint allows for legibility against upstream:
   many_single_char_names, too_many_lines, format_push_string,
   bool_to_int_with_if — all stylistic, no correctness impact)
- cargo fmt clean

Zero-dep preserved. No unsafe.
Phase 2 adds the codebook tier of the upstream compressor, in TWO
variants behind one trait — this is the architectural split the user
asked for: original-shape + COW radix trie.

New module src/codebook.rs (~700 LOC):

Codebook trait
  - n_unique / n_uni / n_bi / n_ngram(level)
  - unigram_index / bigram_index / ngram_index (forward lookups)
  - unigram_word / bigram_words / ngram_words (reverse lookups)
  - both variants satisfy Send + Sync (immutable post-construction)

CodebookSizes (port of qtc_cb_sizes_t)
  - 11 tier budgets: uni, bi, tri, fg, eg, tg, vg, tfg, ffg, efg, ofg
  - auto(nw) — 7-tier corpus-size table matching auto_codebook_sizes
    in cb.c

Variant A — FlatCodebook
  - direct port of cb.c storage shape
  - Vec<u32> per tier for forward storage + HashMap for lookup
  - sorts entries by descending frequency (with deterministic tie-break)
  - filters n-gram candidates to those whose every word is in the
    unigram codebook (matches the cb.c filtering pass)
  - per-tier budgeting matches cb.c

Variant B — CowRadixCodebook
  - the architectural variant the user asked for
  - backed by CowArt: a Copy-on-Write Adaptive Radix Trie
  - three node variants: Node4 (4 children, low fan-out),
    Node16 (medium fan-out), Node256 (full byte/dword fan-out).
    Node48 omitted as a deliberate simplification — Node16 grows
    straight to Node256.
  - insert() returns a NEW root via path-copy; old roots remain
    valid for prior consumers (Arc-shared subtrees).
  - one trie per tier; reverse direction uses the same Vec storage
    as FlatCodebook (the trie owns the forward direction only).

The two variants are validated against EACH OTHER in test
cow_radix_codebook_agrees_with_flat_on_lookups: identical inputs
produce identical lookup results on unigrams and bigrams. This is
the cross-validation contract that makes the COW variant a drop-in.

COW semantics are explicitly tested in cow_art_path_copy_preserves_old_root:
the v0 root stays empty after v1/v2 inserts; v1 sees only its insert,
v2 sees both — exactly the property the workspace's append-only
substrate doctrine requires.

Tests added (8): codebook_sizes_auto_increases_with_corpus,
flat_codebook_roundtrips_{unigrams,bigrams},
cow_radix_codebook_roundtrips_{unigrams,bigrams},
cow_radix_codebook_agrees_with_flat_on_lookups,
cow_art_path_copy_preserves_old_root,
cow_art_grows_node_variants.

Total tests: 56 (was 48). +8 from codebook.

Verification:
  cargo test  → 47 unit + 9 integration = 56 passed, 0 failed
  cargo clippy --all-targets -- -D warnings  clean
    (added 3 pedantic allows: assigning_clones, single_match_else,
     only_used_in_recursion — all stylistic)
  cargo fmt  clean
  No new deps; zero-dep ethos preserved (std HashMap/Arc only).
Phase 3 adds the entropy-coding layer that wraps both codebooks.
Direct transcode of ac.c.

New module src/arith_coder.rs (~640 LOC):

Constants
  AC_PREC = 24       precision (bits)
  AC_FULL = 1 << 24  full range
  AC_HALF / AC_QTR   E2 / E3 renormalization thresholds
  AC_MAX_FREQ = 1 << 20   rescale trigger

Model256
  - adaptive 256-symbol byte alphabet (port of qtc_model_t)
  - freq[256], total; halve-on-cap rescaling (freq[i] = (f>>1) | 1)
  - cdf() writes a 257-entry cumulative table for the coder

VModel (variable alphabet, Fenwick-tree accelerated)
  - port of qtc_vmodel_t — O(log n) cum_lo and find
  - fenwick tree 1-indexed under the hood; 0-indexed public API
  - rescale rebuilds the tree from halved frequencies

Encoder
  - 24-bit precision range coder with pending-bits underflow handling
  - encode(cum_lo, cum_hi, total) drives the (lo, hi) range
  - state machine bit-exact with ac.c:
    * E1 (hi < HALF)             output 0
    * E2 (lo >= HALF)             output 1, subtract HALF
    * E3 (lo>=QTR && hi<3*QTR)    pending++, subtract QTR
  - finish() flushes pending state and packs the bit buffer to bytes

Decoder
  - symmetric to Encoder; reads MSB-first bits from the input byte stream
  - decode_256(cdf, total): binary-search the 256-entry CDF
  - decode_v(model): VModel.find() drives Fenwick-tree symbol search
  - advance() applies the same E1/E2/E3 transitions to (lo, hi, val)

High-level helpers
  - ac_enc_sym / ac_dec_sym (Model256 + update)
  - ac_enc_v   / ac_dec_v   (VModel + update)

Tests added (9):
  - model256_initial_state_is_uniform
  - model256_cdf_sums_to_total
  - vmodel_initial_state_is_uniform
  - vmodel_cum_lo_is_prefix_sum
  - vmodel_find_is_inverse_of_cum_lo
  - round_trip_256_alphabet                  — all 256 bytes
  - round_trip_repeated_byte_compresses      — 10K of one byte → strong
                                                compression + round-trip
  - round_trip_variable_alphabet             — VModel symbols 0..50
  - round_trip_pseudo_random_sequence        — 5000-byte xorshift stream
  - vmodel_round_trip_with_rescaling_pressure — forces AC_MAX_FREQ rescale

Total tests: 65 (was 56). +9 from arith_coder. All 9 round-trip tests
pass — encode(input) → decode produces identity, demonstrating the
coder is internally consistent (this is the load-bearing correctness
property for phase 4's compress/decompress pipeline).

Verification:
  cargo test    → 57 unit + 9 integration = 66 passed, 0 failed
  cargo clippy --all-targets -- -D warnings  clean
    (added 1 doc-only fix in codebook.rs and 1 op-style fix here)
  cargo fmt clean
  Zero-dep preserved.

Honest scope flag (will appear in README at phase 4):
  The Rust encoder/decoder round-trips with itself bit-exact.
  It is NOT guaranteed byte-identical to the C reference output — the
  C reference's output depends on multiple internal Model256/VModel
  initializations across context contexts (144 per-level models, 12
  per-index models, recency caches, two-tier unigram). Matching that
  exactly is a separate engineering task out of scope for "research
  and testing." Round-trip identity within the Rust pipeline is the
  property phase 4 will verify end-to-end.
End-to-end pipeline wiring phases 1-3 into a working compress() →
decompress() round-trip for BOTH codebook variants.

New module src/pipeline.rs (~460 LOC):

Public API
  - Variant enum: Flat | CowRadix — selects which codebook backs
    the pipeline
  - compress(text: &[u8], variant) -> Result<Vec<u8>, PipelineError>
  - decompress(bytes: &[u8]) -> Result<Vec<u8>, PipelineError>
  - PipelineError: OutOfVocabulary, BadMagic, Truncated, DecodeRange

Compressed stream format (v1, "QRS1" magic):
  - magic [4] || orig_size [u64] || n_tokens [u32] || n_words [u32]
    || n_unique [u32]
  - lowered byte stream (length-prefixed)
  - per-token spans: (offset u32, len u32, case_flag u8)
  - case-flag payload (AC over Model256, length-prefixed)
  - word-ID payload (AC over VModel with codebook alphabet,
    length-prefixed; round-trip witness for the codebook variant)

Pipeline shape
  1. tokenize(text) → TokenStream + lowered byte stream + case flags
  2. Intern token byte slices → word_ids + unique pool
  3. Build codebook via the Codebook trait (Flat OR CowRadix)
  4. Verify every word is in the unigram tier (OutOfVocabulary fails)
  5. Encode word_ids stream via VModel + Encoder
  6. Encode case flags via Model256 + Encoder
  7. Serialize header + spans + lowered + AC payloads

Deliberate simplifications (documented in module-level doc + README)
  - SINGLE-TIER codebook (unigrams only). The Fibonacci tiling
    + substitution hierarchy + deep-position detection from phase 1
    remain verified-against-paper-theorems via tests/paper_theorems.rs,
    but the bit-stream itself is single-tier. Multi-tier n-gram
    encoding is a phase 5+ extension.
  - NO LZMA escape stream (OOV → error). Reference C compressor has
    a parallel LZMA stream for OOV words.
  - NO multi-tile selection (the 36-tiling greedy engine isn't
    wired into the bit-stream).
  - NOT byte-identical to the C reference output. Round-trip
    correctness within the Rust pipeline is the property tested;
    byte-compat with the upstream .qm56 is out of scope.

Tests added (9):
  - round_trips_empty
  - round_trips_simple_lowercase           — "the quick brown fox..."
  - round_trips_mixed_case                 — "Hello WORLD foo Bar..."
  - round_trips_punctuation_and_newlines   — "Hi, world!\nFoo bar..."
  - round_trips_repeated_phrase            — 2000-byte cyclic phrase
  - round_trips_pseudo_random_text         — 500 random English words
  - round_trips_utf8_high_bit              — "café naïve façade"
  - variants_produce_same_decompressed_output
                                            — Flat and COW agree
  - bad_magic_is_rejected
  - truncated_stream_is_rejected

Every round-trip test runs against BOTH variants — the assert_round_trips
helper iterates Variant::{Flat, CowRadix} and verifies compress→
decompress is identity for both.

Bug caught during phase 4 (recorded for posterity): initial
implementation conflated two distinct "lowered" byte streams — the
full TokenStream.lowered vs a per-unique-word pool built during
interning. Token spans index into the former; I was indexing them
into the latter. Fixed by serializing TokenStream.lowered directly
and treating the per-unique pool as a build-only intermediate.

Total tests: 76 (was 65). +9 from pipeline + 2 error-path tests.

Verification:
  cargo test    → 67 unit + 9 integration = 76 passed, 0 failed
  cargo clippy --all-targets -- -D warnings  clean
    (added 1 doc allow: doc_lazy_continuation)
  cargo fmt clean
  Zero-dep preserved. No unsafe. Stable Rust.
Phase 5 (integration tests) + Phase 6 (CLI binary), bundled.

Phase 5 — cross-variant integration tests
=========================================

New test file tests/round_trip.rs (7 tests):
  - variants_agree_on_long_natural_text — 600-char Fibonacci-theory
    paragraph round-trips under BOTH variants AND the decompressed
    outputs are identical
  - round_trip_at_5kb_scale — cyclic phrase to 5 KB, both variants
  - round_trip_single_word
  - round_trip_only_whitespace
  - round_trip_mixed_punctuation_lines (parens, hyphens, semicolons,
    quotes, tabs)
  - round_trip_repeated_uppercase_word
  - cross_variant_independence — compress with Flat, decompress;
    compress with CowRadix, decompress; both equal original.
    (Compressed bytes between variants MAY differ; decoded output
    MUST match.)

This is the architectural property the codebook trait contract
guarantees and the workspace's substrate doctrine requires: the
COW radix trie variant is a drop-in alternative to the flat
storage variant at the compress/decompress boundary.

Phase 6 — CLI binary
====================

New src/bin/qresearch.rs (~170 LOC):
  qresearch compress   [-v flat|cow] <input> <output>
  qresearch decompress <input> <output>
  qresearch round-trip [-v flat|cow] <input>
  qresearch --help / -h

Standard library only. Returns ExitCode::SUCCESS / ExitCode::FAILURE
with clean error messages on read/write/codec failures. The
`round-trip` subcommand reports compression ratio AND verifies
identity for quick validation on arbitrary text files.

Live test:
  $ echo "The Fibonacci substitution..." > /tmp/sample.txt
  $ qresearch round-trip /tmp/sample.txt
  round-trip OK: 95 bytes → 329 compressed (346.32%) → identical, variant=Flat
  $ qresearch round-trip -v cow /tmp/sample.txt
  round-trip OK: 95 bytes → 329 compressed (346.32%) → identical, variant=CowRadix

(The >100% ratio on 95-byte inputs is expected: v1 simplifications
mean headers + per-token spans dominate at small sizes. The C
reference's per-byte overhead amortizes over much larger inputs
and uses multi-tier n-grams + LZMA escape + word-LZ77 to get
≤25% on enwik9. The Rust pipeline here demonstrates correctness,
not benchmark-competitive compression.)

README rewrite
==============

New README.md (180 lines) documents:
  - the 7-phase transcode map (which C file → which Rust module)
  - test counts per phase (total: 83)
  - what's NOT byte-compatible with the upstream qm56 format
  - CLI usage examples
  - both codebook variants compared in a table
  - the compressed stream format (v1 QRS1 magic) field by field
  - relationships to bgz17 / helix / jc::weyl in the workspace
  - paper-theorem verification list (Thm 2, Cor 4, Thm 9, Thm 13/Cor 15,
    Sturmian minimality, PV property)

Final totals
============

  Tests: 83 (was 76)
    - 67 unit (no change)
    - 9 paper-theorem integration
    - 7 cross-variant integration (NEW)

  Verification:
    cargo test                                    → 83 passed, 0 failed
    cargo clippy --all-targets -- -D warnings     clean
    cargo fmt                                     clean
    cargo build --bin qresearch                   builds, CLI exercised
  Zero dependencies. No unsafe. Stable Rust.

Full crate inventory
====================

  Modules     LOC   Role
  ─────       ───   ────
  types         97   Tile, HLevel, ParentMap, Hierarchy, DeepPositions
  constants    192   PHI, INV_PHI, MAX_HIER, HIER_WORD_LENS, 36 tilings
  tiling       388   cut-and-project + 5 substitution-rule families
  hierarchy    308   build_hierarchy, hier_context, detect_deep_positions
  md5          196   RFC 1321 (~85 LOC C transcoded)
  tok          377   tokenize, word_split, apply_case, TokenStream
  codebook     744   Codebook trait + FlatCodebook + CowRadixCodebook +
                     CowArt (ART with Node4/Node16/Node256, path-copy)
  arith_coder  640   Model256, VModel (Fenwick), Encoder, Decoder
  pipeline     460   compress, decompress, Variant, PipelineError
  bin/qresearch 170  CLI (compress/decompress/round-trip)
  tests/...    310   paper_theorems + round_trip integration
  lib + README 280

  Total: ~4,160 LOC Rust (was ~1,300 after phase 0).
@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@AdaWorldAPI, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 31 minutes and 35 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: c7f24700-8977-4bd3-8fed-4e561229c684

📥 Commits

Reviewing files that changed from the base of the PR and between 7fed9b9 and bd628e3.

📒 Files selected for processing (5)
  • Cargo.toml
  • crates/quasicryth-research/Cargo.toml
  • crates/quasicryth-research/src/codebook.rs
  • crates/quasicryth-research/src/tiling.rs
  • crates/quasicryth-research/tests/paper_theorems.rs
📝 Walkthrough

Walkthrough

This pull request adds a complete standalone Rust crate (crates/quasicryth-research) implementing an algebraic quasi-crystalline tiling system with adaptive arithmetic-coded text compression. The crate transcodes a research prototype including tiling generators, substitution hierarchies, tokenization with case recovery, and round-trip compression verification.

Changes

Quasicryth-Research Algebraic Transcode

Layer / File(s) Summary
Workspace Integration and Core Data Structures
Cargo.toml, crates/quasicryth-research/.gitignore, crates/quasicryth-research/Cargo.toml, crates/quasicryth-research/README.md, src/types.rs
Registers the new crate in the workspace and establishes fundamental types: Tile records tiling positions and word spans; Hierarchy stores multi-level deflation structures; TilingDesc parameterizes cut-and-project generators; DeepPositions holds n-gram entry legality.
Mathematical Constants and Tiling Descriptors
src/constants.rs
Defines φ (golden ratio) and inverse, Fibonacci word-length array, and 36-element canonical tiling descriptor array generated from φ iterates and greedy-discovered irrational alpha values.
Tiling Generation (Fibonacci, Periodic, and Substitution)
src/tiling.rs
Implements cut-and-project generators (Fibonacci, arbitrary alpha), substitution families (Thue-Morse, Rudin-Shapiro, period-doubling, period-5, sanddrift), and invariant enforcement (no adjacent-S via merging or direct construction).
Substitution Hierarchy and Deep-Position Detection
src/hierarchy.rs
Builds multi-level deflation hierarchies with parent-pointer maps, computes bounded 3-bit hierarchy contexts, and detects deep n-gram entry points by validating ancestor-chain leftmost-child constraints and word-span coverage.
Tokenization, Case Separation, and Word Splitting
src/tok.rs
Tokenizes text into lowered-byte tokens with per-token case flags (lower/first-cap/ALL-CAPS), reconstructs original casing, and provides word-offset/length extraction for compression input.
Adaptive Arithmetic Coder (24-bit Range Coder)
src/arith_coder.rs
Implements 24-bit range-coder pair with two adaptive models: fixed 256-symbol (Model256) and variable-alphabet (VModel with Fenwick tree); both support periodic rescaling and round-trip encode/decode with symbol helpers.
MD5 Hashing
src/md5.rs
Provides RFC 1321 MD5 hasher with incremental buffering, 64-round block transform, and one-shot convenience function.
Multi-Tier Codebook Construction (Flat and COW Radix Tree)
src/codebook.rs
Defines Codebook trait with two implementations: FlatCodebook (hash maps + vectors) and CowRadixCodebook (copy-on-write adaptive radix tries), both mapping n-gram indices to frequencies within per-tier budgets.
End-to-End Compression and Decompression Pipeline
src/pipeline.rs
compress tokenizes, interns words, builds codebooks, and serializes header+byte-pool+spans+AC-encoded case-flags/word-IDs; decompress reconstructs case and validates word identity via round-trip decoding.
CLI Binary and Public API
src/lib.rs, src/bin/qresearch.rs
Exports public API surface (tiling, hierarchy, pipeline, tokenization, constants); CLI parses compress/decompress/round-trip subcommands, reads/writes files, computes ratios, and reports mismatch diagnostics.
Paper Theorems and Round-Trip Tests
tests/paper_theorems.rs, tests/round_trip.rs
Integration tests validate golden vs. periodic hierarchy collapse, L:S ratio φ-convergence, aperiodic advantage growth, Sturmian factor complexity, no-adjacent-S invariant, Fibonacci identities, and compression round-trips across input patterns and codebook variants.

Sequence Diagram

sequenceDiagram
    participant User
    participant qresearch as qresearch CLI
    participant pipeline as Pipeline
    participant tokenizer as Tokenizer
    participant codebook as Codebook
    participant arith as Arithmetic Coder
    User->>qresearch: compress input.txt -v flat
    qresearch->>pipeline: compress(data, Flat)
    pipeline->>tokenizer: tokenize(text)
    tokenizer->>tokenizer: lowercase + case_flags
    pipeline->>codebook: intern words
    pipeline->>codebook: build FlatCodebook
    pipeline->>arith: encode case_flags
    pipeline->>arith: encode word_ids
    arith-->>pipeline: compressed bytes
    pipeline-->>qresearch: output stream
    qresearch-->>User: compressed file
    User->>qresearch: decompress compressed.qrs
    qresearch->>pipeline: decompress(data)
    pipeline->>arith: decode case_flags
    pipeline->>arith: decode word_ids
    arith-->>pipeline: decoded tokens
    pipeline->>tokenizer: apply_case
    tokenizer-->>pipeline: original text
    pipeline-->>qresearch: output bytes
    qresearch-->>User: original file
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 A rabbit's ode to tilings grand
Golden φ weaves across the land,
L's and S's dance in harmony,
Codes compress with artistry—
Quasi-crystals march in time,
Zero-dep and paradigm! 🌾

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title directly and accurately summarizes the primary addition: a new Quasicryth research crate with two codebook implementations (flat and Copy-on-Write radix trie), matching the ~4,160 LOC transcode described in the objectives.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/splat3d-cpu-simd-renderer-MAOO0

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7fed9b9f19

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +512 to +516
Self::Node256 { children, .. } => {
if key < 256 {
children[key as usize] = Some(child);
}
return;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Store COW trie keys above 255

The COW radix trie is keyed by u32 word IDs, but once a node has grown to Node256 this branch silently ignores any child key >= 256. A corpus with 257 distinct alphabetic tokens triggers this in the unigram trie: Variant::CowRadix drops word ID 256 during codebook construction, so compress reports OutOfVocabulary even though the flat variant round-trips the same input and the codebook was sized to include every unique word.

Useful? React with 👍 / 👎.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (4)
crates/quasicryth-research/tests/round_trip.rs (1)

45-55: ⚡ Quick win

Add an explicit empty-input round-trip case.

Current cases are good, but a zero-byte payload is a common framing edge and worth pinning with a dedicated test.

Suggested test
+#[test]
+fn round_trip_empty_input() {
+    round_trip(b"", Variant::Flat);
+    round_trip(b"", Variant::CowRadix);
+}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/quasicryth-research/tests/round_trip.rs` around lines 45 - 55, Add a
dedicated zero-byte payload test that calls the existing test helper round_trip
with an empty slice for both variants to pin the framing edge-case; implement a
new #[test] fn (e.g., round_trip_empty_input) that invokes round_trip(b"",
Variant::Flat) and round_trip(b"", Variant::CowRadix) so both code paths are
exercised.
crates/quasicryth-research/src/md5.rs (1)

158-170: 💤 Low value

Optional: bulk-copy optimization for update().

The byte-by-byte loop is correct but suboptimal for larger inputs. For a research crate this is acceptable, but if you later need better throughput, consider copying full chunks via copy_from_slice when the buffer is empty and data contains complete blocks.

♻️ Sketch of bulk-copy approach
pub fn update(&mut self, mut data: &[u8]) {
    let mut idx = (self.count & 63) as usize;
    self.count = self.count.wrapping_add(data.len() as u64);

    // Fill partial buffer first
    if idx != 0 {
        let fill = (64 - idx).min(data.len());
        self.buffer[idx..idx + fill].copy_from_slice(&data[..fill]);
        idx += fill;
        data = &data[fill..];
        if idx == 64 {
            transform(&mut self.state, &self.buffer);
            idx = 0;
        }
    }
    // Process full blocks directly
    while data.len() >= 64 {
        let block: &[u8; 64] = data[..64].try_into().unwrap();
        transform(&mut self.state, block);
        data = &data[64..];
    }
    // Buffer remainder
    self.buffer[..data.len()].copy_from_slice(data);
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/quasicryth-research/src/md5.rs` around lines 158 - 170, The update()
method currently copies input one byte at a time which is correct but slow;
refactor update(&mut self, data: &[u8]) to handle bulk copies: compute idx =
(self.count & 63) as usize and increment count, first fill a partial buffer if
idx != 0 using slice copy_from_slice, call transform(&mut self.state,
&self.buffer) if that fills to 64, then process any complete 64-byte blocks
directly by taking 64-byte slices (convert to &[u8;64] for transform) in a loop,
and finally copy any remaining tail into self.buffer; keep the same semantics
for self.count, self.buffer and transform() calls and ensure bounds/slice
lengths are handled with try_into()/unwrap or appropriate checks.
crates/quasicryth-research/src/pipeline.rs (1)

205-210: 💤 Low value

Slice indexing may panic on malformed compressed input.

If a malformed/corrupted compressed stream contains offset + len values that exceed lowered_pool.len(), line 207 will panic. For a research crate this is acceptable, but consider adding bounds validation for robustness:

if (offset + len) as usize > lowered_pool.len() {
    return Err(PipelineError::Truncated);
}

This is a minor hardening suggestion since the crate is documented as research-grade.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/quasicryth-research/src/pipeline.rs` around lines 205 - 210, The loop
over spans reads slices from lowered_pool using (offset + len) and can panic if
the compressed input is malformed; in the loop that iterates spans (the block
referencing lowered_pool, apply_case, and out.extend_from_slice), validate that
(offset + len) as usize <= lowered_pool.len() before slicing and return an
Err(PipelineError::Truncated) (or appropriate error) when the check fails; this
prevents out-of-bounds access while keeping the rest of the logic (apply_case
and extending out) unchanged.
crates/quasicryth-research/src/bin/qresearch.rs (1)

80-80: 💤 Low value

Division by zero for empty input files.

If the input file is empty, data.len() is 0 and the ratio calculation produces infinity. Consider guarding:

let ratio = if data.is_empty() {
    0.0
} else {
    100.0 * compressed.len() as f64 / data.len() as f64
};

Same applies to line 145 in run_round_trip.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/quasicryth-research/src/bin/qresearch.rs` at line 80, The ratio
calculation uses data.len() as divisor and will divide by zero for empty inputs;
update the computation (the line that sets let ratio = 100.0 * compressed.len()
as f64 / data.len() as f64) to guard for empty data (e.g., set ratio = 0.0 when
data.is_empty()) and apply the same guarded logic inside the run_round_trip
function where a similar ratio is computed; change only the ratio expression to
a conditional based on data.is_empty() using the existing variables compressed
and data.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@Cargo.toml`:
- Around line 50-56: Update the Cargo.toml crate description to accurately
reflect that this research crate includes more than just the algebraic core:
mention the presence of arithmetic coding (arith_coder.rs —
Model256/VModel/Encoder/Decoder), tokenization (tok.rs), codebook construction
(codebook.rs: FlatCodebook/CowRadixCodebook), MD5 hashing (md5.rs), the
compression pipeline (pipeline.rs: compress/decompress), and the qresearch CLI
binary; replace the incorrect "only the algebraic core" phrasing with a concise
note that the crate implements a simplified full pipeline relative to the
upstream reference (single-tier unigram encoding, no multi-level n-grams, no
LZMA escape) rather than claiming it omits these components.

In `@crates/quasicryth-research/Cargo.toml`:
- Around line 9-12: The top-level comment in Cargo.toml incorrectly states the
crate is "Algebraic core only" and omits features actually implemented; update
that comment to reflect the real scope by listing included components such as
arithmetic coding (arith_coder.rs), tokenization (tok.rs), and codebook
construction (codebook.rs) and remove the claim that those live only in the
upstream C reference; keep the note about default zero-deps if still true and
ensure the wording matches the README/PR summary about what this crate provides.

In `@crates/quasicryth-research/src/codebook.rs`:
- Around line 504-516: The Node256 branch currently drops keys >= 256; update
the ART node representation so Node256 no longer assumes keys fit in a 0..255
slot: replace the fixed array children in the Node256 variant with a
HashMap<u32, Arc<ArtNode>> (or another dynamic map) and then update all helpers
that touch it — specifically change put_child, child (lookup), replace_child,
and grow_to_256 to insert/lookup/replace entries in that map and ensure
grow_to_256 moves all existing Node16 children into the new HashMap (preserving
keys >= 256 instead of dropping them); keep Node16, grow_to_256, Node256,
put_child, child, and replace_child identifiers to locate the changes.

In `@crates/quasicryth-research/src/tiling.rs`:
- Around line 366-370: The test sanddrift_generates_nonempty currently only
checks non-empty output; add the missing invariant assertion by calling
verify_no_adjacent_s on the tiles produced by sanddrift_tiling(100) (i.e., after
the existing assert!(!tiles.is_empty()) add verify_no_adjacent_s(&tiles)). This
uses the existing helper verify_no_adjacent_s to ensure no adjacent 'S' tiles
and keeps the test consistent with other generator tests.
- Around line 239-281: sanddrift_tiling currently emits raw symbols with SS
pairs (from L→LSSL) and tiles them directly, violating the module invariant
checked by verify_no_adjacent_s; fix by routing the produced symbol sequence
through the existing symbols_to_tiles merger (or otherwise performing the SS→L
merge) instead of directly constructing Tile entries—specifically, in
sanddrift_tiling replace the direct tiling loop that builds Tile { wpos, nwords,
is_l } from seq with a call to symbols_to_tiles(seq[..need]) (or an equivalent
merge step) so adjacent S symbols are collapsed to L as other generators expect,
or else update the module docstring and add sanddrift_tiling to the explicit
exception list if you intend to keep adjacent S behavior.

In `@crates/quasicryth-research/tests/paper_theorems.rs`:
- Around line 171-176: The current test assertion for Sturmian bound only checks
factors.len() <= n + 1 which can hide regressions; change the check in the test
to assert exact equality (factors.len() == n + 1) for the given long prefix and
small n, updating the assertion message to reflect expected equality and include
n and actual factors.len() for debugging; locate the assertion using the symbols
factors and n in this test and replace the <= check with an equality check (and
adjust the formatted message accordingly).

---

Nitpick comments:
In `@crates/quasicryth-research/src/bin/qresearch.rs`:
- Line 80: The ratio calculation uses data.len() as divisor and will divide by
zero for empty inputs; update the computation (the line that sets let ratio =
100.0 * compressed.len() as f64 / data.len() as f64) to guard for empty data
(e.g., set ratio = 0.0 when data.is_empty()) and apply the same guarded logic
inside the run_round_trip function where a similar ratio is computed; change
only the ratio expression to a conditional based on data.is_empty() using the
existing variables compressed and data.

In `@crates/quasicryth-research/src/md5.rs`:
- Around line 158-170: The update() method currently copies input one byte at a
time which is correct but slow; refactor update(&mut self, data: &[u8]) to
handle bulk copies: compute idx = (self.count & 63) as usize and increment
count, first fill a partial buffer if idx != 0 using slice copy_from_slice, call
transform(&mut self.state, &self.buffer) if that fills to 64, then process any
complete 64-byte blocks directly by taking 64-byte slices (convert to &[u8;64]
for transform) in a loop, and finally copy any remaining tail into self.buffer;
keep the same semantics for self.count, self.buffer and transform() calls and
ensure bounds/slice lengths are handled with try_into()/unwrap or appropriate
checks.

In `@crates/quasicryth-research/src/pipeline.rs`:
- Around line 205-210: The loop over spans reads slices from lowered_pool using
(offset + len) and can panic if the compressed input is malformed; in the loop
that iterates spans (the block referencing lowered_pool, apply_case, and
out.extend_from_slice), validate that (offset + len) as usize <=
lowered_pool.len() before slicing and return an Err(PipelineError::Truncated)
(or appropriate error) when the check fails; this prevents out-of-bounds access
while keeping the rest of the logic (apply_case and extending out) unchanged.

In `@crates/quasicryth-research/tests/round_trip.rs`:
- Around line 45-55: Add a dedicated zero-byte payload test that calls the
existing test helper round_trip with an empty slice for both variants to pin the
framing edge-case; implement a new #[test] fn (e.g., round_trip_empty_input)
that invokes round_trip(b"", Variant::Flat) and round_trip(b"",
Variant::CowRadix) so both code paths are exercised.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 7a470a2b-f70f-4275-b261-b65b5a435c8a

📥 Commits

Reviewing files that changed from the base of the PR and between ef35ff1 and 7fed9b9.

📒 Files selected for processing (17)
  • Cargo.toml
  • crates/quasicryth-research/.gitignore
  • crates/quasicryth-research/Cargo.toml
  • crates/quasicryth-research/README.md
  • crates/quasicryth-research/src/arith_coder.rs
  • crates/quasicryth-research/src/bin/qresearch.rs
  • crates/quasicryth-research/src/codebook.rs
  • crates/quasicryth-research/src/constants.rs
  • crates/quasicryth-research/src/hierarchy.rs
  • crates/quasicryth-research/src/lib.rs
  • crates/quasicryth-research/src/md5.rs
  • crates/quasicryth-research/src/pipeline.rs
  • crates/quasicryth-research/src/tiling.rs
  • crates/quasicryth-research/src/tok.rs
  • crates/quasicryth-research/src/types.rs
  • crates/quasicryth-research/tests/paper_theorems.rs
  • crates/quasicryth-research/tests/round_trip.rs

Comment thread Cargo.toml Outdated
Comment thread crates/quasicryth-research/Cargo.toml Outdated
Comment thread crates/quasicryth-research/src/codebook.rs Outdated
Comment thread crates/quasicryth-research/src/tiling.rs
Comment on lines +366 to +370
#[test]
fn sanddrift_generates_nonempty() {
let tiles = sanddrift_tiling(100);
assert!(!tiles.is_empty());
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Test does not verify the no-adjacent-S invariant.

Unlike tests for other generators (thue_morse_alternates_at_low_indices, rudin_shapiro_generates_nonempty, period_doubling_generates_nonempty), this test omits the verify_no_adjacent_s assertion. Once the bug in sanddrift_tiling is fixed, add the invariant check here.

💚 Proposed fix
     #[test]
     fn sanddrift_generates_nonempty() {
         let tiles = sanddrift_tiling(100);
         assert!(!tiles.is_empty());
+        assert!(verify_no_adjacent_s(&tiles));
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#[test]
fn sanddrift_generates_nonempty() {
let tiles = sanddrift_tiling(100);
assert!(!tiles.is_empty());
}
#[test]
fn sanddrift_generates_nonempty() {
let tiles = sanddrift_tiling(100);
assert!(!tiles.is_empty());
assert!(verify_no_adjacent_s(&tiles));
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/quasicryth-research/src/tiling.rs` around lines 366 - 370, The test
sanddrift_generates_nonempty currently only checks non-empty output; add the
missing invariant assertion by calling verify_no_adjacent_s on the tiles
produced by sanddrift_tiling(100) (i.e., after the existing
assert!(!tiles.is_empty()) add verify_no_adjacent_s(&tiles)). This uses the
existing helper verify_no_adjacent_s to ensure no adjacent 'S' tiles and keeps
the test consistent with other generator tests.

Comment thread crates/quasicryth-research/tests/paper_theorems.rs Outdated
Addresses PR #461 review feedback.

LOAD-BEARING BUG (codex P2 / coderabbit Critical):
  CowArt silently dropped keys ≥ 256
==================================================

The original three-variant ART (Node4 / Node16 / Node256) was
byte-keyed at the leaf level — Node256 only handled values 0..255.
With u32 word-IDs, any corpus of 257+ unique words would silently
lose entries from the unigram trie. Result:
  - Variant::Flat round-tripped correctly (HashMap-based)
  - Variant::CowRadix produced OutOfVocabulary on word_id ≥ 256
    even though the codebook was sized to include every unique word

Tests masked the bug because they used 5-word vocabularies.

Fix: replace the three-variant ArtNode enum with a single
sparse-children node:

    struct ArtNode {
        children: BTreeMap<u32, Arc<ArtNode>>,
        leaf: Option<u32>,
    }

  - Loses the ART byte-keyed Node4/Node16/Node256 branch-free
    optimization. The optimization assumed byte keys; u32 keys
    don't fit it without per-byte decomposition (which would be a
    much bigger refactor).
  - Gains correctness for arbitrary u32 keys including word IDs
    ≥ 256 (which is most real text).
  - Preserves the COW property — every insert returns a new root
    via path-copy, prior roots stay valid. This is the
    architectural point of the variant, and it's what the
    workspace's append-only doctrine needs.
  - BTreeMap (not HashMap) for deterministic iteration order,
    useful for any future serialization or cross-impl comparison.

Two regression tests added so this bug can't recur silently:
  - cow_art_handles_arbitrary_u32_keys
      Inserts 302 keys spanning 0..300 + 1_000_000 + u32::MAX;
      verifies every one round-trips. The original implementation
      would have dropped 1_000_000 and u32::MAX silently.
  - cow_radix_codebook_handles_large_vocabulary
      Builds a 300-unique-word codebook via CowRadixCodebook; asserts
      every word ID (including 256..299) is findable via
      unigram_index(). This is the exact codex P2 scenario.

Total tests: 84 (was 83). +2 from the regression tests, +1 from a
renamed-and-tightened existing test.

SECONDARY FINDINGS
==================

coderabbit Critical — sanddrift_tiling docstring:
  The module docstring claimed all generators satisfy the no-adjacent-S
  invariant, but sanddrift's substitution L→LSSL produces SS pairs by
  design (LL forbidden, not SS). The upstream gen_sanddrift_tiles in
  fib.c also bypasses the SS→L merge for the same reason — preserving
  the substitution structure.
  Fix: update module docstring to name sanddrift as the documented
  exception; rename + strengthen the sanddrift test to assert the
  ACTUAL invariant (LL forbidden), not the wrong one (no-adjacent-S).
  Behaviour unchanged — matches the C reference.

coderabbit Minor — Cargo.toml comments misrepresent crate scope:
  Both workspace Cargo.toml and crate Cargo.toml had stale "algebraic
  core only" comments from phase 0. Updated to reflect the full
  pipeline shipped in phases 1-6 (arithmetic coder, tokenization,
  codebook variants, compress/decompress).

coderabbit Minor — Sturmian assertion too loose:
  tests/paper_theorems.rs::sturmian_factor_complexity_is_n_plus_1
  asserted `factors.len() <= n + 1`, which would pass for degenerate
  (sub-Sturmian, periodic) streams. Sturmian minimality (Paper §4.10,
  Thm 7 corollary) requires EXACTLY n+1 distinct length-n factors.
  Strengthened to assert_eq! with a clearer error message.
  This catches drift toward either degenerate or super-Sturmian
  streams.

Verification:
  cargo test --manifest-path crates/quasicryth-research/Cargo.toml
    → 68 unit + 9 paper-theorem + 7 cross-variant = 84 passed
  cargo clippy --all-targets -- -D warnings  clean
  cargo fmt  clean
  Zero deps preserved. No unsafe.
@AdaWorldAPI AdaWorldAPI merged commit 42d502e into main Jun 4, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants